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Abstract 

We address the online linear optimization problem with bandit feedback. Our contribution 
is twofold. First, we provide an algorithm (based on exponential weights) with a regret of 
order ^ dn log N for any finite action set with N actions, under the assumption that the instan- 
taneous loss is bounded by 1. This shaves off an extraneous factor compared to previous 
works, and gives a regret bound of order d^/nTogn for any compact set of actions. Without 
further assumptions on the action set, this last bound is minimax optimal up to a logarithmic 
factor. Interestingly, our result also shows that the minimax regret for bandit linear optimiza- 
tion with expert advice in d dimension is the same as for the basic li-armed bandit with expert 
advice. Our second contribution is to show how to use the MiiTor Descent algorithm to obtain 
computationally efficient strategies with minimax optimal regret bounds in specific examples. 
More precisely we study two canonical action sets: the hypercube and the Euclidean ball. In 
the former case, we obtain the first computationally efficient algorithm with a d^/n regret, thus 
improving by a factor ^/d\ogn over the best known result for a computationally efficient algo- 
rithm. In the latter case, our approach gives the first algorithm with a ^/dnlogn regret, again 
shaving off an extraneous Vd compared to previous works. 
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1 Introduction 



In this paper we consider the framework of online linear optimization: at each time instance 
t = 1, . . . ,n, the player chooses, possibly in a randomized way, an action from a given compact 
action set A C M.'^. The action chosen by the player at time t is denoted by at G A. Simultane- 
ously to the player, the adversary chooses a loss vector zt E Z C M.'^ and the loss incurred by the 
forecaster is aj Zt. The goal of the player is to minimize the expected cumulative loss E J2t=i '^J 
where the expectation is taken with respect to the player's internal randomization (and possibly the 
adversary's randomization). In the basic version of this problem, the player observes the adver- 
sary's move Zt at the end of round t. We consider here the bandit version, where the player only 
observes the incurred loss aJ Zt. As a measure of performance we define the regret of the player as 



In this paper we are interested in the dual setting, where the adversary plays on a dual action set, 
i.e., A and Z are such that \aJ z\ < 1, V(a, z) E A x Z. 

1.1 Contributions and relation to previous works 

In the full information case, the online optimization setting (for convex losses) was introduced by 
Zinkevich [2003]. The specific online linear optimization problem with bandit feedback was first 
studied by McMahan and Blum [2004] and /werbuch and Kleinberg [2004]. Our first contribution 
to this problem is to complete the research program started by Dani et al. [2008] and Cesa-Bianchi and Lugosi 
[201 1]. In these papers the authors studied the exp2 (Expanded Exp) algorithm, also called Ge- 
ometric Hedge, Expanded Hedge, or ComBand. This strategy applies to a finite set of actions; it 
assigns an exponential weight to each action, and then draws an action at random from the cor- 
responding probability distribution. Using a basic estimation procedure (first used by Auer et al. 
[2002] for the basic multi-armed bandit problem), one can estimate the loss vector zt. However, 
to control the range of the estimates, one has to mix the probability given by exp2 with an "ex- 
ploration distribution". Dani et al. [2008] chose this distribution to be uniform over a barycentric 
spanner for the action set, while in [Cesa-Bianchi and Lugosi, 201 1] the distribution was uniform 
over all actions. Using ideas from convex geometry, we propose a new distribution that allows 
us to derive a minimax optimal regret bound. More precisely, we show that for any finite action 
set, exp2 with the exploration distribution given by John's Theorem (see Theorem 3) attains a 
regret of order ^/ch^\ogN for any set of actions. This improves by a factor over previous 
works. Moreover this rate is optimal: there exists action sets (such as the hypercube) where the 
minimax rate is of order dy/n — see [Dani et al., 2008]. Surprisingly, this result also shows that 
Exp2 with John's exploration can be used for linear bandits with A^ experts to obtain a regret of 
order ^/dJn\ogN, which is no worse than the minimax regret for the basic rf-armed bandit with A^ 
experts problem. 

While these results show that, without further assumption on the set of action, the regret of 
exp2 is optimal, they do not say anything about optimality for a specific set of actions. In 
fact, it was proven by Audibert et al. [201 1] that for some pair (A, Z) the exponential weights 
is a provably suboptimal strategy (with a gap of order \/d). To address this issue, another class 
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of algorithms has been studied for online optimization: the Mirror Descent style algorithms of 
Nemirovski and Yudin [1983] — this class of algorithms was rediscovered in the learning commu- 
nity, see for example Kivinen and Warmuth [2001]. In recent years the number of papers using 
Mirror Descent to solve problems in online optimization has been growing very rapidly. In the 
full information setting (when one observes Zt), we have a very good understanding of how to use 
Mirror Descent to obtain optimal regret bounds that adapt to the geometry of the problem — see 
[Rakhlin, 2009, Hazan, 2011, Bubeck, 2011]. In particular, a recent paper suggests that in this 
basic setting Mirror Descent is "universal", see [Srebro et al., 20 11]. On the other hand, in the lim- 
ited feedback scenario the picture is much more scattered. In the particular cases of semi-bandit 
feedback — see [Audibert et al., 201 1] — and two-points bandit feedback — see [Agarwal et al., 
2010], we know how to use Mirror Descent to obtain optimal regret bounds. However, in both sce- 
narios the feedback is much stronger than in the more fundamental bandit problem. In this latter 
case, there is only one paper that successfully applies Mirror Descent, namely the seminal work 
of Abernethy et al. [2008] — see also the follow-up paper Abernethy and Rakhlin [2009]. Unfor- 
tunately, for a convex and compact set A, this approach (which combines Mirror Descent with a 
self-concordant barrier for the action set) leads to a regret bound of order dy/9n log n for any 9 > 
such that A admits a 6'-self concordant barrier. For example, in the case of the hypercube the best 
we know is 9 = 0{d), which results in the suboptimal d'^/^ y/n log n regret (compared to d^/n for 
exp2 with John's ellipsoid). However, note that in this particular case it is not known if exp2 can 
be implemented efficiently, while Mirror Descent is polynomial time. 

Our second main contribution is to propose an efficient algorithm based on Mirror Descent, 
with an optimal regret bound for two canonical pairs {A,Z). Namely, the (hypercube, cross- 
polytope) pair, which corresponds to an Loo/Li type of constraints, and the (Euclidean ball, Eu- 
clidean ball) pair, which corresponds to an constraint. In the former case this results in the 
first computationally efficient algorithm with a regret of order dy/n, while in the latter case it is the 
first efficient algorithm with a regret of order dn log n. Indeed, the approach of Abernethy et al. 
[2008] only gives d^/nAogn for the pair (Euclidean ball, Euclidean ball) since there exists a 0(1)- 
self concordant barrier for the Euclidean ball. Note also that this specific example was studied in 
Abernethy and Rakhlin [2009], we discuss their result in Section 5. 

1.2 Outline of the paper 

The paper is organized as follows. In Section 2 we introduce the two algorithms discussed in 
the paper: Expanded Exp (exp2) and Online Stochastic Mirror Descent (OSMD). In both cases 
we state a general regret bound. In Section 3 we detail our exploration strategy for exp2, and 
show the corresponding regret bound. We also discuss briefly the extension to linear bandits with 
expert advice. Then in Section 4 (respectively Section 5) we show how to use OSMD to obtain 
a computationally efficient strategy with optimal regret for the hypercube (respectively for the 
Euclidean ball, up to a logarithmic factor). 

2 Algorithms 

We briefly describe here the two algorithmic templates that we shall use in this paper. First, exp2 
is described in Figure 1. The general regret bound for this algorithm is the following. The proof of 
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Algorithm: exp2 with exploration fi. 

Parameters: learning rate 77; mixing coefficient 7; distribution /i over the action set A. 
Let qi = . . . , -j-ij) e M'-^L For each round t = 1, 2, . . . , n; 

(a) Let = (1 - 7)gt + 7/i, and play at pt. 

(b) Estimate the loss vector Zt by = P^atoJ Zt, with Pj = E^^pj [aa^] . 

(c) Update the exponential weights, for all a E A, 

Figure 1: exp2 strategy for bandit feedback. 



this result follows a standard argument, see for example [Chapter 7, Bubeck [201 1]]. 

Theorem 1 Let A be a finite set of N actions. For the exp2 strategy, provided that ri\aJzt\ < 
1, Va G one has 

Rn < 27n + ^ + r/ E ^ p,(a) (a^I^) ' • 

^ t=l a£A 

Figure 2 describes OSMD in the bandit setting. Note that step (c) can be written in several equivalent 
ways, such as a Follow The Regularized Leader equation, or a mirror gradient descent step if F is 
a Legendre function. When written as a gradient descent step, one usually has to project back on A 
(using the Bregman divergence associated to F). Here the projection is implicit in the evaluation 
of VF*. The following theorem states a general regret bound for OSMD. Recall that the Bregman 
divergence with respect to F is defined as Dp{x, y) = F{x) — F{y) — (x — y)^VF(?/), and the 
Legendre-Fenchel dual of F is defined as F*{v) = sup^.g^ x~^v — F{x). In the following, we write 
x\ to denote xi + ■ ■ ■ + Xt. 

Theorem 2 Let A be a compact set of actions, and F a function with effective domain A, and 
such that F* is differentiable on W^. Then OSMD satisfies (for any norm || ■ || j 

Rn < ^^P-^^^(^^ ~ ^^^^^ + 1 gEZ}^.(-r/il, -ryi^-^) + |;E||a, - E[a, | aj|| • ||.,||. . 
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Algorithm: osmd. 

Parameters: learning rate 77 > 0; regularization function F : M'^ — )• M U {+00} with effective 
domain A, and such that the Legendre-Fenchel dual F* is differentiable on M'^; perturbation 
scheme for step (a) below. 

Let ai G argmin^g_4 F(a). For each round t = 1,2, ... ,n; 

(a) Play at at random from some probability distribution pt over A 

(at is a randomly perturbated version of at, see Section 4 and Section 5 for examples). 

(b) Estimate the loss vector zt by zt = P^atolzt, with Pt = E^^pj [aa^~\ . 

(c) Let at+i = VF* (-r/ J2T=\ ^s) ■ 



Figure 2: Online Stochastic MiiTor Descent (OSMD) for bandit feedback. 

Proof The proof is adapted from Kakade et al. [2010]. Using Young's inequality, one obtains 

yaeA 

n 

t=i 

n 

= F{a) + F*{0) + Y.{f* (-r/il) - F* {-V^f')) 
t=i 

n 

= F{a) + F*(0) + ^ (VF* (-r/ij-i)^ (-r/z,) + Dp. (-r/ij, -r/^f^)) 

n 

= F{a) + F*{0) + J2 i-VaJzt + Dp, {-r^l[, -v^f')) 
t=i 

since F*(0) = -F(ai). This shows that: 

{at-ayzt< ' ^ +-}^D^.(-r/il,-r/gl 

t=i ' ' t=i 

Taking into account the randomness induced by at and it is then an easy exercise, see for example 
[Bubeck, 20 1 1 , Chapter 7] . ■ 

This theorem proves to be particularly useful when applied with a Legendre function F — see 
[Cesa-Bianchi and Lugosi, 2006, Chapter 11] for the definition of a Legendre function. Indeed, 
in that case F* is differentiable if F is differentiable, and moreover the corresponding gradient 
mappings are inverse of each other, which gives a simple way to do computations with the Bregman 
divergence D p* ■ 
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3 exp2 with John's exploration 



We propose here a new exploration distribution ji for the exp2 strategy, that allows us to derive 
the first ^/dn log N regret bound for online linear optimization with bandit feedback. We use the 
following result from convex geometry, see [Ball, 1997] for a proof. 

Theorem 3 Let K d be a convex set. If the ellipsoid £ of minimal volume enclosing /C is the 
unit ball in some norm derived from a scalar product (■,■), then there exists M < d{d + l)/2 + 1 
contact points ui, . . . , um between £ and fC, and ^ & Am (the simplex of dimension M — 1), such 
that 

M 

X = d''^^ fii{x,Ui)Ui,\/x G M'^. 

i=l 

To use this theorem, we need to perform a preprocessing of the action set as follows: 

1. First, we assume that A is of full rank (that is such that linear combinations of A span M"^). 
If it is not the case, then one can rewrite the elements of A in some lower dimensional vector 
space and work there. 

2. Find John's ellipsoid for Con — i.e., the ellipsoid of minimal volume enclosing Conv{A): 
£ = {x E MJ^ : {x — XqY H~^{x — Xq) < 1}. The first preprocessing step is to translate 
everything by xq. In other words, we assume now that A is such that xq = 0. Furthermore, 
we define the inner product (x, y) = x~^Hy. 

3. We can now assume that we are playing on A' = H^^A, and the loss of playing a' G A' 
when the adversary plays z is {a', z). Indeed: {H~^a, z) = z. Moreover, note that John's 
ellipsoid for Conv{A!) is the unit ball for the inner product (-, ■) because {H^^x, H^^x) = 
x~^ H~^x. 

4. Find the contact points ui, . . . , um and fi G Ajv/ that satisfy Theorem 3 for Conv{A!). Note 
that the contact points are in A! , thus they are valid points to play. We say that ji is John's 
exploration distribution. 

In the following we drop the prime on A! . More precisely, we play on a set A such that John's 
ellipsoid for Conv{A) is the unit ball for some inner product (■, ■), and the loss is given by (a, z). 
Thus, we also need to slightly change the algorithm to account for the fact that the loss is now an 
arbitrary scalar product. Step (c) in Figure 1 is modified as: 

^ ^ exp(-?7(a,St))gt(a) 

^ Efee^exp(-r/(6,it))gt(6)' 

We also modify the loss estimate given by step (b) as follows. Recall that the outer product u® u 
is defined as the linear mapping from to such that u ® u{x) = {u, x)u. Note that one can 
also view M^Masarfxd matrix, so that the evaluation oiu® uh equivalent to a multiplication 
by the corresponding matrix. Now let: 

Pt = ^pt{a)a (g) a. 
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Note that this matrix is invertible, since A is of full rank and Pt{a) > 0, Va G A. The estimate for 
Zt is given by: 

zt = Pf^ {at ® at) Zt. (1) 

Note that this is a valid estimate since {at ® at) Zt = {at, Zt)at and Pf^ are observed quantities. 
Moreover, it is also clearly an unbiased estimate. We can now prove the following result. 

Theorem 4 EXP2 with John's exploration and estimate (1) satisfies, for ^ < 1, 

log iV , 

Rn < H h Tjnd. 

V 



In particular with = rjd and f] = J ^^-^ we have that 



Rn < 2^/3nd\ogN. 

Proof With the chosen scalar product, it is easy to see that the condition 77 if | < 1 in Theorem 1 
rewrites as ri\ (a, zt) \ < 1, while the third term in the regret bound rewrites as E "^aeA Pti*^) ('^' ^t)^- 
Thus it remains to control those two quantities. Let us start with the latter: 

^Pt{a){a,gtf = ^pt{a){zt, {a (g) a)zt) 
aeA aeA 

= {zt,PtZt) = {at,Ztf{Pt-'at,PtPt-\) < {Pt~\,at). 

Now we use a spectral decomposition of Pt in an orthonormal basis for (■, ■) and write Pt = 
Yli=i '^i'^i ® ^i- III particular, we have Pt^^ = J2i=i X"^* ® ^^'^ thus: 

E{Pt''^at,at) = ^—E{{vi ®Vi)at,at) = ^ — E((at ^at)vi,Vi) = —{PtVi,Vi) = d. 
i=i i=i i=i 

This concludes the bound for E ^^g^pi(a)(a, if)^. We turn now to {a,zt): 

1 



{a,zt) = {at,Zt){a,Pt ^at) < {a, P^ ^at) < 



where the last inequality follows from the fact that (a, a) <1 for any a ^ A, since A is included 
in the unit ball. Now to conclude the proof we need to lower bound the smallest eigenvalue of Pt- 
Using Theorem 3, one can see that Pt ^ jid, and thus Aj > ^ concluding the proof. ■ 

Using the discretization argument of Dani et al. [2008], exp2 with John's exploration can be used 
to obtain a regret of order ^Jdn log n for any compact set of action A. 



3.1 Computational issues 

If ^is given by a finite set of points, then Grotschel et al. [1993] give a polynomial time algorithm 
for computing a constant factor approximation to the John's ellipsoid (and this approximate basis 
will provide the same order of regret). However, if A is specified by the intersection of half spaces. 
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then Nemirovski [2007] shows that obtaining such a constant factor approximation to this ellipsoid 
is NP-hard in general. Here, it is possible to efficiently compute an ellipsoid where the factor of 
d in Theorem 3 is replaced by d^/^ — see [Grotscliel et al., 1993], which leads to a slightly worse 
dependence on d in the regret bound. 

In special cases, we conjecture that the John's ellipsoid may be computed efficiently, as for 
certain problems, there are efficient implementations of GeometricHedge that lead to optimal rates 
(such as shortest path problems and other settings where dynamic programming solutions exists). 

3.2 Application to bandits with experts 

Consider the following model of linear bandits with N experts. At each time step t = 1,2, ... ,n, 
each expert k = 1, . . . , N suggests an action at{k) E W^. The goal here is to compete with the 
best expert, that is at each time step the strategy chooses an expert kt G {1, . . . , A^} and the regret 
is given by: 

n n 

R-^P = EY,citihyzt- ^ min^ Ej^atikV^t. 
t=i ke{i,...,N} 

One can use exp2 with John's exploration to obtain a regret of order ^/d/n\ogN for this problem. 
Indeed, it suffices at every turn to do the preprocessing step on At = • • • , at{N)} and to 

build the corresponding John's exploration /it, the straightforward details are omitted. 

For example, at each time t each expert i = 1, . . . , N is associated with a hidden loss estimate 
Zt{i) G Z and an arbitrary "context set" At A is observed. Each expert i then suggests the best 
action according to the current loss estimate, at{i) = aigmin^^_^^ Zt{iY a . This can be viewed as 
a natural nonstochastic variant of the contextual linear bandit model of Chu et al. [201 1]. Another 
notable special case is the d-armed bandit problem with expert advice, where we can view the sug- 
gested actions as the corners of the rf-dimensional simplex. Here, the EXP4 algorithm of Auer et al. 
[2002] achieves a regret of order ydnkiN . Interestingly, the regret achievable in the more general 
(i-dimensional linear optimization setting is no worse than in the seemingly simpler d-armed bandit 
with expert advice setting. 

4 Computationally efficient strategy for the hypercube 

In this section we restrict our attention to the action set ^ = {x G M'^ : ||a;||oo < !}• Using exp2 
with John's exploration on {—1, 1}'' one obtains a regret bound of order dy/n for this problem, 
and as it was shown by Dani et al. [2008] this regret is minimax optimal. However, it is not known 
if it is possible to sample from the exponential weights distribution in polynomial time for this 
particular set of actions. In this section we propose to turn to OSMD, and we show that with the 
appropriate regularizer F and random perturbation at (see step (a) in Figure 2), one can obtain a 
minimax optimal algorithm with computational complexity linear in d. More precisely we use an 
entropic regularizer 

1 

i=l 

together with the following perturbation of a point at in the interior of A: 
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With probability 7, play at uniformly at random from the canonical basis (with random 
sign). With probability 1 — 7, play at = ^t where ^t(?) is drawn from a Rademacher 
with parameter l±|iW . 

It is easy to check that this perturbation is almost unbiased, indeed one has: 

E atii) = (1 - 7) y ^ ^ j = (1 - 7)a^(^), 

and thus: 

||E[at I a^ - at\^ < 7. (3) 
We can now prove the following result. 

Theorem 5 Consider the online linear optimization problem with bandit feedback on A = {x G 
'■ \\x\\oo < 1}, and with ^ = {x G M"' : ||a;||i < 1}. Then OSMD on A with regularizer (2) 
satisfies, for any rj and 7 G (0, 1) such that ^ < |, 



rflog2 " ' 



Rn < 7^+ + r/^^E[(l - at{{}'')zt{ 



I? 



(4) 



In particular, with 7 = 2d\J and V = y 

Rn < 2dy/3n log 2. (5) 

Remark that the regularizer (2) used here is in the class of Legendre functions with exchangeable 
Hessian. More precisely, following Audibertet al. [2011], (2) can be written (up to a numerical 
constant) as 



F{x) = ^J tanh~\s)ds . 



This type of regularizer was first studied (implicitely) by Audibert and Bubeck [2009] and Audibert and Bubeck 
[2010]. 

Proof Since F is Legendre on A, F* is differentiable on and the gradient mapping of F* is 
the inverse of the gradient mapping of F. Therefore, (VF*)j = tanh because (VF*)j = tanh^^. 
Then, thanks to (3) and Theorem 2, the regret can be bounded as: 



7n 

V V 



t=i 



For the first term it is easy to see that F{a) — F(ai) < d log 2. For the term involving the Bregman 
divergence, using elementary computations one obtains 



d , 



^ , , X ■ I 1 coshfwj) 

Dp* [u,v) = y \ log — — - iixah{vi){ui - Vi 

^ ^ ' cosh(t>j) 
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To prove (4) we need to show that Dp* (m, v) < Yl'i=i ~ tanh^(f j)) (wj — f j)^. In fact, we prove 
that this inequality is true as soon as — f||oo < ^- The fact that the property is satisfied for the 
pair {u, v) = (— ?7i|, —rjz'f^) under consideration is established at the very end of the proof. 

Using a basic hyperbolic identity, and the elementary inequalities exp(x) < 1 + a; + x^, Va; : 
|a;| < 1 and log(l + x) < x, one obtains 

/ cosh(Mi)\ 
— rr^ ~ isjah{vi){Ui - Vi) 



cosh(ui 

f cosh{vi) cosh{ui - Vi) + smh{vi) smh{ui - Vi)\ ^f 

log — - ifmh.{yi){ui - Vi 

\ cosh[vi) J 

log ^cosh(Mj — Vi) + tanh(fj) sinh(Mj — fj)^ — tanh(fi)(Mj — Vi) 

/I + tanhff j) 1 — tanhff j) 
log exp(Mi - Vi) H exp(-(Mi - Vi)) 



— logexp ^tanh(fj)('Uj — Vi) 

/I + tanhfwj) , ^ 
= log I ^ exp ((1 - tanh(wi))(ui - Vi)) 

1 — tanhft^j) , 1 / \N / N\ 

H ^ exp ( - (1 + tanh(i;i))(Mi - Vi)) 

< log (1 + (1 - tanh\vi)){ui - v^)^) < (1 - tanh\v^)){ui - Vi)^ 

which concludes the proof of (4). Now for the proof of (5) we first compute the matrix P^: 

d 

EataJ = ^h + il--f)Y,^UiMj)eieJ 

= ^I, + (1 - 7)/. + (1 - 7) J]E6(z)6(j) eicj 
= + (1 - l)Id + (1 - 7) J] ati^atU) e,ej 

d 

= + (1 - -f)ataj + (1 - 7) ^(1 - at{t)^)eiej. 

i=l 

To obtain (5) first note that (1 - 7) Eti ^[(^ " < ^ zj PtZf Now we use a spectral 

decomposition of Pt in an orthonormal basis and write: Pt = X]f=i \vivj . In particular we have 
= Eti i-^'i^*^ and thus: 

d ^ 1 1 
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To conclude the proof it remains now to show that ?7||2;j||oo < |- First note that the smallest 
eigenvalue of Pt is larger than -j/d, and thus: 

r]\zt{i)\ = v\ej Pi%aJ zt\ < 7]\eJPf%\ < — <\ 

'72 

where the penultimate inequality follows from \elat\ < 1 and the last inequality follows from the 
assumption on 77 and 7. ■ 



5 Improved regret for the Euclidean ball 

In this section we restrict our attention to the action set ^ = {x E M"' : < 1}, where 
II ■ II denotes the Euclidean norm. Using exp2 with John's exploration on a discretization of the 
Euclidean ball one obtains a regret bound of order d^/n\ogn for this problem. A similar regret 
bound can be obtained with a computationally efficient algorithm, using the technique developed 
by Abemethy et al. [2008]. Here we show that in fact one can attain efficiently a regret of order 
dn log n using OSMD with the approriate regularizer F and random perturbation a^. More pre- 
cisely here we use F{x) = — log(l — ||x||) — ||x|| (the motivation for this particular regularizer 
comes from the proof, see below). Moreover we perform the following perturbation of a point at 
in the interior of A: 

Let be a Bernoulli of parameter \\at\\, let It be drawn uniformly at random in 
{1, . . . ,d}, and let et be Rademacher with parameter |. If = 1, then play at = 
at/ II at II, else play at = EtCj^. 

It is easy to check that this perturbation is unbiased, in the sense that E[at | at] = at- Here we 
modify the estimate of step (b) in Figure 2, and instead we use: 

it = (1 - ^t)- — ^— f7(2;t'^at)at. (6) 
1 - llflill 

It is easy to check that this estimator satisfies the same key unbiasedness property than the one in 
step (b) in Figure 2, that is E [zt \ at] = Zt. 

Note that the problem studied in this section was also specifically considered in Abemethy and Rakhlin 
[2009], with an emphasis on high probability bounds. In this paper the authors used the self- 
concordant barrier F{x) = — log(l — ||x|p) with a similar perturbation scheme to the one proposed 
above. They obtain suboptimal rates, but a more careful analysis (precisely slightly modifying Sec- 
tion V.B., step (E)) can actually yield the same rate than the one we obtain. The strength of our 
approach is that it is in a sense more elementary (e.g., we do not require any results from the In- 
terior Point Methods literature), but on the other hand the result of /ibernethy and Rakhlin [2009] 
holds with high probability (though it is not clear if it possible to get the rate \/ dn log n with high 
probability). 
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Theorem 6 Consider the online linear optimization problem with bandit feedback on A = {x & 
: < 1}, and with Z = {x eW^ : \\x\\ < 1}. Then OSMD on A! = {x e M"' : ||x|| < 1 - 7} 
with the estimate (6), and F{x) = — log(l — ||x||) — ||x|| satisfies, for any r] such that rjd < \, 



^ t=l 



(7) 



In particular, with 1 = ^ cind rj = \j -f^, 



Rn < "i^J dn \ogn. (8) 

Proof First, it is clear that by playing on A! instead of A, one incurs an extra 7n regret. Second, 
note that F is stricly convex (it is the composition of a convex and nondecreasing function with 
the euclidean norm), differentiable, and 

VF(x) = . (9) 

In particular F is Legendre on ^ = {x G M'' : < 1}, and thus F* is differentiable on W^. Now 
the regret with respect to A can be bounded as follows, thanks to Theorem 2, 

sup.,^,F(a)-F(aO ^ 1 ^ ^ ^^^^^^ _ ~ ^ ^^^^ 

The first term is clearly bounded by ^ log ^ (we use the fact that ai = 0). For the second term we 
need to do a few computations (the first one follows from (9) and the fact that F is Legendre): 

?/ 

VF*(u) -- 



1 + iinir 



F*{u) = -\og{l + \\u\\) + \\ul 
Df*{u,v) = - — 3---- ( - + ||-u|| ■ ||-i;|| - w'^M - (1 + ||i;||) log ( 1 + ^ 



l+\\v 

Let Q{u, v) such that Dp* {u, v) = 'S){u, v). First note that 

i + l|vW.)ll = ' - ■ 

Thus, in order to prove (7) it remains to show that 0(n, f) < \\u—v\\'^,for{u,v) = [—'r]z[,—r]z[~^). 
In fact we shall prove that this inequality holds true as soon as ^^"^^^^ — "'"his is the case for 
the pair (u, v) under consideration, since by the triangle inequality, equations (6) and (10), and the 
assumption on 77: 



1 + \\v\\ ~ 1 + \\v\\ - ' - 2 
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Now using that log(l + x) > x — x^, Va; > — ^ we obtain that for u, v such that 



1+ k 



> 



e{u,v) < 



\u\\ — \\v\ 



II II + \m\ ■ — V u 
1 + IK'll 



I Il2 I II ||2 

+ \\v\\ 



lull ■ \\v\\ — V^U 



= \\u — vW"^ + 2v^u — \\u\\ ■ \\v\\ — v^u 
which concludes the proof of (7). Now for the proof of (8) it suffices to note that: 



E 



(l-||a,||)i|i,ipj =(l-||a,||)^ 



1 - \\at\ 



1=1 



d {l-\\at\ 



d^zt^ < d 



along with straightforward computations. 
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